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Abstract 

In this paper, we address the problem of discriminative dictionary learning (DDL), 
where sparse linear representation and classification are combined in a probabilis- 
tic framework. As such, a single discriminative dictionary and linear binary clas- 
sifiers are learned jointly. By encoding sparse representation and discriminative 
classification models in a MAP setting, we propose a general optimization frame- 
work that allows for a data-driven tradeoff between faithful representation and 
accurate classification. As opposed to previous work, our learning methodology 
is capable of incorporating a diverse family of classification cost functions (includ- 
ing those used in popular boosting methods), while avoiding the need for involved 
optimization techniques. We show that DDL can be solved by a sequence of up- 
dates that make use of well-known and well-studied sparse coding and dictionary 
learning algorithms from the literature. To validate our DDL framework, we apply 
it to digit classification and face recognition and test it on standard benchmarks. 



1 Introduction 

Representation of signals as sparse linear combinations of a basis set is popular in the signal/image 
processing and machine learning communities. In this representation, a sample y is described by 
a linear combination x of a sparse number of columns in a dictionary D, such that y = Dx. 
Significant theoretical progress has been made to determine the necessary and sufficient conditions, 
under which recovery of the sparsest representation using a predefined D is guaranteed Il3ll27ll4l. 
Recent sparse coding methods achieve state-of-the-art results for various visual tasks, such as face 
recognition (29]. Instead of minimizing the £q norm of x, these methods solve relaxed versions of 
the originally NP-hard problem, which we will refer to as traditional sparse coding (TSC). However, 
it has been empirically shown that adapting D to underlying data can improve upon state-of-the-art 
techniques in various restoration and denoising tasks I6ji23j. This adaptation is made possible by 
solving a sparse matrix factorization problem, which we refer to as dictionary learning. Learning D 
is done by alternating between TSC and dictionary updates fl] [8] |20l fTSl . For an overview of TSC, 
dictionary learning, and some of their applications, we refer the reader to |28 7 1. 

In this paper, we address the problem of discriminative dictionary learning (DDL), where D is 
viewed as a linear mapping between the original data space and the space of sparse representations, 
whose dimensionality is usually higher. In DDL, we seek an optimal mapping that yields faithful 
sparse representation and allows for maximal discriminability between labeled data. These two 
objectives are seldom complimentary and they tend to introduce conflicting goals in many cases, 
thus, classification can be viewed as a regularizer for reliable representation and vice versa. From 
both viewpoints, this regularization is important to prevent overfitting to the labeled data. Therefore, 
instead of optimizing both objectives simultaneously, we seek joint optimization. In the case of 
sparse linear representation, the problem of DDL was recently introduced and developed in lfT9ll2Tl 
|22 I. under the name supervised dictionary learning (SDL). In this paper, we denote the problem 
as DDL instead of SDL, since DDL inherently includes the semi-supervised case. SDL is also 
addressed in a recent work on task-driven dictionary learning ifTSl . The form of the optimization 
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problem in SDL is shown in Eq. ([TJ. The objective is a linear combination of a representation cost 
Br and a classification cost ec using data labels L and classifier parameters W. 



min eii(Y,X,D)+Aec(X,W,L) (1) 

X,D,W 

Although f2? Tl] use multiple dictionaries, it is clear that learning a single dictionary allows for 
sharing of features among labeled classes, less computational cost, and less risk of overfitting. As 
a result, our proposed method learns a single dictionary D. Here, we note that lfT3l addresses a 
similar problem, where D is predefined and ec is the Fisher criterion. Despite their merits, SDL 
methods have the following drawbacks, (i) Most methods use limited forms for ec (e.g. softmax 
applied to reconstruction error). Consequently, they cannot generalize to incorporate popular classi- 
fication costs, such as the exponential loss used in Adaboost or the hinge loss in SVMs. (ii) Previous 
SDL methods weight the training samples and the classifiers uniformly by setting the fixed mix- 
ing coefficient A according to cross-validation. This biases their cost functions to samples that are 
badly represented or misclassified. As such, they are more sensitive to outlier, noisy, and mislabeled 
training data, (iii) From an optimization viewpoint, the SDL objective functions are quite involved 
especially due to the use of the softmax function for multi-class discrimination. 

Contributions: Our proposed DDL framework addresses the previous issues by learning a linear 
map D that allows for maximal class discrimination in the labeled data when using linear classifi- 
cation, (i) We show that this framework is applicable to a general family of classification cost func- 
tions, including those used in popular boosting methods, (ii) Since we pose DDL in a probabilistic 
setting, the representation-classification tradeoff and the weighting of training samples correspond 
to MAP parameters that are estimated in a data-driven fashion that avoids parameter tuning, (iii) 
Since we decouple en and ec, the representations X act as the only liaisons between classification 
and representation. In fact, this is why well-studied methods in dictionary learning and TSC can be 
easily incorporated in solving the DDL problem. This avoids involved optimization techniques. Our 
framework is efficient, general, and modular, so that any improvement or theoretical guarantee on 
individual modules (i.e. TSC or dictionary learning) can be seamlessly incorporated. 

The paper is organized as follows. In Section |2] we describe the probabilistic representation and 
classification models in our DDL framework and how they are combined in a MAP setting. Section 
[3] presents the learning methodology that estimates the MAP parameters and shows how inference 
is done. In Section |4] we validate our framework by applying it to digit classification and face 
recognition and showing that it achieves state-of-the-art performance on benchmark datasets. 

2 Overview of DDL Framework 

In this section, we give a detailed description of the probabilistic models used for representation and 
classification. Our optimization framework, formulated in a standard MAP setup, seeks to maximize 
the likelihood of the given labeled data coupled with priors on the model parameters. 

2.1 Representation and Classification Models 

We assume that each A/-dimensional data sample can be represented as a sparse linear combination 
of K dictionary atoms with additive Gaussian noise of diagonal covariance: y = Dx + n; n ^ 
Af{0, cr^I). Here, we view the sparse representation x as a latent variable of the representation 
model. In training, we assume that the training samples are represented by this model. However, 
test samples can be contaminated by various types of noise that need not be zero-mean Gaussian 
in nature. In testing, we have: y = Dx + e + n, where we constrain any auxiliary noise e (e.g. 
occlusion) to be sparse in nature without modeling its explicit distribution. This constraint is used 
in the error correction method for sparse representation in |27|. It is clear that the representation 
in testing is identical to the one in training with the dictionary in the latter being augmented by 
identity. In both cases, the likelihood of observing a specific y is modeled as a Gaussian: (y |x, D) ^ 
Af (Dx, (T^l). Since a single dictionary is used to represent samples belonging to different classes, 
sharing of features is allowed among classes, which simplifies the learning process. 
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To model the classification process, we assume that each data sample corresponds to a label vector 
I E { — 1, +1}'^, which encodes the class membership of this sample, where C is the total number of 
classes. In our experiments, only one value in I is +1. We apply a linear classifier (or equivalently a 
set of additively boosted linear classifiers) to the sparse representations in a one-vs-all classification 
setup. The probabilistic classification model is shown in Eq. where is the classification cost 
function. Note that appending 1 to x intrinsically adds a bias term to each classifier w. Due to the 
linearity of the classifier, discrimination of the class is completely determined by the scalar cost 
function O (x) = fl (zj), where Zj = lj\vjx. This function quantifies the cost of assigning label 

Ij to representation x using the j* classifier Wj . For now, we do not specify the functional form of 
In Section [3] we show that most forms of used in practice are easily incorporated into 
our DDL framework. Since we seek effective class discrimination, we expect low classification cost 
for the given representations. Therefore, by arranging all C linear classifiers in matrix W, the event 
(Zjx. W) can be modeled as a product of C independent exponential distributions parameterized by 
7j for j = 1, . . . , C. By denoting Wj as the classifier of the class, we have: 



2.2 Overall Probabilistic Model 

To formalize notation, we consider a training set of N data samples in M'^ that are columns of the 
data matrix Y e M^^^^. The column of the label matrix L e {+1, -IjC'x^ is the label vector 

li corresponding to the j* data sample. Here, we assume that there are K atoms in the dictionary 
D € M''^^, where X is a fixed integer that is application-dependent. Typically, K ^ d. Note 
that there have been recent attempts to determine an optimal K for a given dataset ll24l . For our 
experiments, K is kept fixed and its optimization is left for future work. The representation matrix 
X e M^^^ is a sparse matrix, whose columns represent the sparse codes of the data samples 
Y using dictionary D. The linear classifiers are columns in matrix W e M^^*^. We denote 
©fl, = {^f^iLi ™d 8(7 — {7j}^i as the representation and classification parameters respectively. 

In what follows, we combine the representation and classification models from the previous sec- 
tion in a unified framework that will allow for the joint MAP estimation of the unknowns: D, 
X, W, Qfj, and Op. By making the standard assumption that the posterior probability con- 
sists of a dominant peak, we determine the required MAP estimates by maximizing the product: 
p(Y|D,X, 67?)p(L|X, W, Q c)p{'d r)p{Q c) ■ Here, we make a simplifying assumption that the 
prior of the dictionary and representations are uniform. To model the priors of 0;^ and Qc ^nd 
to avoid using hyper-parameters, we choose the objective non-parametric Jeffereys prior, which 
has been shown to perform well for classification and regression tasks |9|. Therefore, we obtain 
p{Qr) oc Hill andp(8c) oc Jl^Li The motivations behind the selection of these priors are 
that (i) the representation prior encourages a low variance representation (i.e. the training data should 
properly fit the proposed representation model) and that (ii) the classification prior encourages a low 
mean (and variance|^classification cost (i.e. the training data should be properly classified using the 
proposed classification model). By minimizing the sum of the negative log likelihood of the data 
and labels as well as the log priors, MAP estimation requires solving the optimization problem in 
Eq. ([sjl, where L^^ represents the label of the training sample with respect to the class. 

To encode the sparse representation model, we explicitly enforce sparsity on X by requiring that 
each representation Xi G 5t = {a : ||a||o < T}. An alternative for obtaining sparse representations 
is to assume that Xj follows a Laplacian prior, which leads to an li regularizer in the objective. While 
this sparsifying regularizer alleviates some of the complexity of Eq. (|3]l, it leads to the problem of 
selecting proper parameters for these Laplacian priors. Note that recent efforts have been made to 
find optimal estimates of these Laplacian parameters in the context of sparse coding 30, 2|. 
However, to avoid additional parameters, we choose the form in Eq. ([3]), where the first two terms 
of the objective correspond to the representation cost and the last two to the classification cost. 



'The mean and variance of an exponential distribution with parameter A = - are 7 and 7^ respectively. 




(2) 
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^ 11^ -n^ ||2 ^ c N Q /j ^T^^ c 

min y 11^' ~ , -11^ + y Inaf +^ + V V ^ ""'^ + V ln7f +^ (3) 



In the following section, we show that Eq. (|3]l can be solved for a general family of cost functions 
ri(.) using well-known and well-studied techniques in TSC and dictionary leaming. In other words, 
developing specialized optimization methods and performing parameter tuning are not required. 



3 Learning Methodology 

Since the objective function and sparsity constraints in Eq. Q are non-convex, we decouple the de- 
pendent variables by resorting to a blockwise coordinate descent method (altemating optimization). 
At each iteration, only a subset of variables is updated at a time. Clearly, learning D is decoupled 
from learning W, if X and {Qr, Qc) are fixed. Next, we identify the four basic update procedures 
in our DDL framework. In what follows, we denote the estimate of variable A at iteration fc as A'*^) . 



3.1 Classifier Update 



Since the classification terms in Eq. ([3]) are de- 
coupled from the representation terms and in- 
dependent of each other, each classifier can be 
learned separately. In this paper, we focus on 
fou r pop ular forms of as shown in Fig- 
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(i) the square loss: = (1 — 

optimized by the boosted square leverage 
method |5|, (ii) the exponential loss: fl{z) = 
e^^ optimized by the AdaBoost method |10|, 
(iii) the logistic loss: fl{z) = ln(l + e^^) opti- 
mized by the LogitBoost method ITOl . and (iv) 
the hinge loss: il{z) — max(0, 1—z) optimized 
by the SVM method. Since additive boost- 
ing of linear classifiers yields a linear classifier, 
we allow for seamless incorporation of additive 
boosting, which is a novel contribution. 



Classification Cost 



Classifier Weights 




(a) classification cost 



(b) classifier weights 



Figure 1: Four classification cost functions: 
square, exponential, logistic, and hinge loss in 



1(a) 1(b) plots their impacts on classifier weights 



(second derivatives) in our DDL framework. 



3.2 Discriminative Sparse Coding 

In this section, we describe how well-known and well-studied TSC algorithms (e.g. Orthogonal 
Matching Pursuit (OMP)) are used to update X^'^^^^ from X'*^'. This is done by solving the problem 
in Eq. (j4|, which we refer to as discriminative sparse coding (DSC). DSC requires the sparse code to 
not only reliably represent the data sample but also to be discriminable by the one-vs-all classifiers. 
Here, we denote I as the label vector of the i* data element (i.e. the column of L). The (fc) 
superscripts are omitted from variables not being updated to facilitate readability. Here, we note that 
DSC, as defined here, is a generalization of the functional form used in l,13J . 



arg mm 



b Ax 



c 



7j 



where b = — ; A 



D 



(4) 



Solving Eq. (|4|: The complexity of this solution depends on the nature of However, it is 
easy to show that, by applying a projected Newton gradient descent method to Eq. (j4]), DSC can be 
formulated as a sequence of TSC problems, if fi(z) is strictly convex. At each Newton iteration, a 
quadratic local approximation of the cost function is minimized. If we denote fii (z) and (z) as the 

first and second derivatives of fl{z) respectively and 51i2(z) = nl\z) ' quadratic approximation 

of n{z) around Zp is n{z) « V,{zp) + i}i{zp){z — Zp) + ^Q2{2p){z — Zp)"^. Since ^2(2) is a strictly 
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positive function, we can complete the square to get « ^fl2{zp)[z~ {zp~'ili2{zp))]'^ +cte. By 
replacing this approximation in Eq. (|4ji, the objective function at the (p + 1)* Newton iteration is: 

||b-Ax||2 + ||H(p) (i5(^')-G^x)||2. In fact, this objective takes the form of a TSC problem and, thus, 
can be solved by any TSC algorithm. Here, G is formed by the columnwise concatenation of and 
we define S^P'>{j) — gjx^*''' — ili2(gjx(p)) for j ~ 1, . . . ,C. Also, we define the diagonal weight 

matrix H^^^, where H^P\j,j) — ( ^^*'^^^^ — weights the j* classifier. Based on this derivation, 
the same TSC algorithm (e.g. OMP) can be used to solve the DSC problem iteratively, as illustrated 
in Algorithm [T] The convergence of this algorithm is dependent on whether the TSC algorithm 
is capable of recovering the sparsest solution at each iteration. Although this is not guaranteed in 
general, the convergence of TSC algorithms to the sparsest solution has been shown to hold, when 
the solution is sparse enough even if the dictionary atoms are highly correlated 1 3] |22l [l2l [4 J . In 
our experiments, we see that the DSC objective is reduced sequentially and convergence is obtained 
in almost all cases. Furthermore, we provide a Stop Criterion (threshold on the relative change in 
solution) for the premature termination of Algorithm[T]to avoid needless computation. 



Algorithm 1 Discriminative Sparse Coding (DSC) 

INPUT: A, b, G, a, fl, x(o), T, p^ax, Stop Criterion 
while (Stop Criterion) AND p < pmax do 
compute and form: J'^^'^ and H^^*^; 





b 




A 




( 






_ H(P)G^ _ 



end while 
OUTPUT: x(p) 



Popular Forms of il{z): Here, we focus on particular forms of namely the four functions in 



Section 3.1 Before proceeding, we need to replace the traditional hinge cost with a strictly convex 
approximation. We use the smooth hinge approximation introduced by |17|, which can arbitrarily 
approximate the traditional hinge. As seen before, ^2 (-z) and rii2 (z) are the only functions that play 
a role in the DSC solution. Obviously, only one iteration of Algorithm [T] is needed when the square 
cost is used, since it is already quadratic. For all other fl{z), at the iteration of DSC, the impact 
of the classifier on the overall cost (or equivalently on updating the sparse code) is determined 
by H^^'^(j,j). This weight is influenced by two terms, (i) It is inversely proportional to 7^. So, 
a classifier with a smaller mean training cost (i.e. higher training set discriminability) yields more 
impact on the solution, (ii) It is proportional to ri2(^jwjx*^^')), the second derivative at the previous 
solution. In this case, the impact of the classifier is determined by the type of classification cost 
used. In Figure pXb) we plot the relationship between Vt{z) and 02(2) for all four types. For 
the square and hinge functions, Vl{z) and Vt2{z) are independent, thus, a classifier yielding high 
sample discriminability (low Vl{z)) is weighted the same as one yielding low discriminability. For 
the exponential case, the relationship is linear and positively correlated, thus, the lower a classifier's 
sample discriminability is the higher its weight. This implies that the sparse code will be updated 
to correct for classifiers that misclassified the training sample in the previous iteration. Clearly, 
this makes representation sensitive to samples that are "hard" to classify as well as outliers. This 
sensitivity is overcome when the logistic cost is used. Here, the relationship is positively correlated 
for moderate costs but negatively correlated for high costs. This is consistent with the theoretical 
argument that LogitBoost should outperform AdaBoost when training data is noisy or mislabeled. 



3.3 Unsupervised Dictionary Learning 

When XC^), e^'\ and e^,*'^ ai-e fixed, D^^') can be updated by any unsupervised dictionary learning 
method. In our experiments, we use the KSVD algorithm, since it avoids expensive matrix inversion 
operations required by other methods. Also, efficient versions of KSVD have recently been devel- 
oped II25I . By alternating between TSC and dictionary updates (SVD operations), KSVD iteratively 
reduces the overall representation cost and generates a dictionary with normalized atoms and the 
corresponding sparse representations. In our case, the representations are known apriori, so only a 
single iteration of the KSVD algorithm is required. For more details, we refer the readers to H]. 
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3.4 Parameter Estimation and Initialization 



The use of the Jeffereys prior for 8/j and Qc yields simple update equations: = ( m+2 ~ 
■|-)(fc)jjW||2^5 and 7^'"'' — These variables estimate the sample 

representation variance and the mean/variance of the classification cost respectively. Since the over- 
all update scheme is iterative, proper initialization is needed. In our experiments, we initialize D'"' 
to a randomly selected subset of training samples (uniformly chosen from the different classes) or 
to random zero-mean Gaussian vectors, followed by columnwise normalization. Interestingly, both 
schemes produce similar dictionaries, although the randomized scheme requires more iterations for 
convergence. The representations X^"' are computed by TSC using D'"^ Initializing the remaining 
variables uses the update schemes above. Algorithm|2]summarizes the overall DDL framework. 



Algorithm 2 Discriminative Dictionary Learning (DDL) 
INPUT: Y, L, T, Q, g„iax, Pmax, Stop Criterion 
Initialize D(o), X'"), 6^', e[?\ and g = 
while (Stop Criterion) AND q < qmax do 
for i = 1 to do 

x(9+i) = DSC( ^, W('?)diag(?:), n, x(9), T, p^^x, Stop Criterion); 
end for 

Learn classifiers W(*+^' using L and X('^+^); 
j)(<?+i) = KSVD(D(9),X(9+i),r); 
Update and 7(9+1); ^ = g + i; 

end while 

OUTPUT: D^"?), W^i\ X(«), and 7(9) 



3.5 Inference 

After learning D and W, we describe how the label of a test sample is inferred. We seek the 
class jt that maximizes piytlhij)), where lt{j) is the label vector of yt assuming it belongs to class 
j. By marginalizing with respect to x and assuming a single dominant representation Xj exists, 
jt is the class that maximizes p(yt|x(, D)p(x(|Zt(j), W), as in Eq. ([sj. The inner maximization 

problem is exactly a DSC problem where lt{j) is the hypothesized label vector. Here, we use 
the testing representation model to account for dense errors (e.g. occlusion), thus, augmenting 
D by identity. Computing jt involves C independent DSC problems. To reduce computational 
cost, we solve a single TSC problem instead: xt = argmax-g^^ p(yt|x, D). In this case, jt — 

argmaxj^i c p{lt{j)\xt,W). 



jt = argmax ( maxp(yt|x,D)p(;t(j)|x, W) ) (5) 
jei,...,c y^eSr J 

Implementation Details: There are several ways to speedup computation and allow for quicker 
convergence, (i) The DSC update step is the most computationally expensive operation in Algo- 
rithmic] This is mitigated by using a greedy TSC method (Batch-OMP instead of minimization 
methods) and exploiting the inherent parallelism of DDL (e.g. doing DSC updates in parallel), 
(ii) Selecting suitable initializations for D and the DSC solutions can dramatically speedup con- 
vergence. For example, choosing D^^^ from the training set leads to a smaller number of DDL 
iterations than randomly choosing D^'^'. Also, we initialize DSC solutions at a given DDL iteration 
with those from the previous iteration. Moreover, the DDL framework is easily extended to the 
semi-supervised case, where only a subset of training samples are labeled. The only modification to 
be made here is to use TSC (instead of DSC) to update the representations of unlabeled samples. 
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4 Experimental Results 



In this section, we provide empirical analysis of our DDL framework when applied to handwrit- 
ten digit classification (C = 10) and face recognition (C = 38). Digit classification is a standard 
machine learning task with two popular benchmarks, the USPS and MNIST datasets. The digit sam- 
ples in these two datasets have been acquired under different conditions or written using significantly 
different handwriting styles. To alleviate this problem, we use the alignment and error correction 
technique for TSC that was introduced in ll26l . This corrects for gross errors that might occur (e.g. 
due to thickening of handwritten strokes or reasonable rotation/translation). Consequently, we do 
not need to augment the training set with shifted versions of the training images, as done in |18|. 
Furthermore, we apply DDL to face recognition, which is a machine vision problem where sparse 
representation has made a big impact. We use the Extended Yale B (E-YALE-B) benchmark for 
evaluation. To show that learning D in a discriminative fashion improves upon traditional dictio- 
nary learning, we compare our method against a baseline that treats representation and classification 
independently. In the baseline, X and D are estimated using KSVD, W is learned using X and L 
directly, and a a winner-take-all classification strategy is used. Clearly, our framework is general, 
so we do not expect to outperform methods that use domain-specific features and machinery. How- 
ever, we do achieve results comparable to state-of-the-art. Also, we show that our DDL framework 
significantly outperforms the baseline. In all our experiments, we set q^.^^ = 20 and p^ax = 100 and 
initialize D to elements in the training set. 

Digit Classification: The USPS dataset comprises N = 7291 training and 2007 test images, each 
of 16 X 16 pixels {M — 256). We plot the test error rates of the baseline for the four classifier types 
and for a range of T and K values in Figure |2] Beneath each plot, we indicate the values of K 
and T that yield minimum error. This is a common way of reporting SDL results iTSl [191 1211 l22l . 
Interestingly, the square loss classifier leads to the lowest error and the best generalization. For 
comparison, we plot the results of our DDL method in Figure |3] Clearly, our method achieves a 
significant improvement of 4.5% over the baseline, and 1% and 0.5% over the SDL methods in lfT9l 
and ifTsll respectively. Our results are comparable to the state-of-the-art performance (2.2%) lfT6l ). 
This result shows that adapting D to the underlying data and class labels yields a dictionary that is 
better suited for classification. Increasing T leads to an overall improvement of performance because 
representation becomes more reliable. However, we observe that beyond T = 3, this improvement 
is insignificant. The square loss classifier achieves the lowest performance and the logistic classifier 
achieves the highest. The variations of error with K are similar for all the classifiers. Error steadily 
decreases till an "optimal" K value is reached. Beyond this K value, performance deteriorates due 
to overfitting. Future work will study how to automatically predict this optimal value from training 
data, without resorting to cross-validation. 

In Figure]?] we plot the learned parameters O/j (in histogram form) and 8^ for a typical DDL setup. 
We observe that the form of these plots does not significantly change when the training setting is 
changed. We notice that the histogram fits the form of the Jeffereys prior, p{x) cc i. Most of the 
cr values are close to zero, which indicates reliable reconstruction of the data. On the other hand, Qq 
take on similar values for most classes, except the "0" digit class that contains a significant amount 
of variation and thus the highest classification cost. Note that these values tend to be inversely 
proportional to the classification performance of their corresponding linear classifiers. We provide 
a visualization of the learned D in the supplementary material. Interestingly, we observe that the 
dictionary atoms resemble digits in the training set and that the number of atoms that resemble a 
particular class is inversely proportional to the accuracy of that class's binary classifier. This occurs 
because a "hard" class contains more intra-class variations requiring more atoms for representation. 

The MNIST dataset comprises N = 60000 ti-aining and 10000 test images, each of 28 x 28 pixels 
(M — 784). We show the baseline and DDL test error rates in Table [T] We train each classifier 
type using the K and T values that achieved minimum error for that classifier on the USPS dataset. 
Compared to the baseline, we observe a similar improvement in performance as in the USPS case. 
Also, our results are comparable to state-of-the-art performance (0.53%) for this dataset 1 14|. 

Face Recognition: The E-YALE-B dataset comprises 2, 414 images of C = 38 individuals, each 
of 192 X 168 pixels, which we downsample by an order of 8 {M = 504). Using a classification 
setup similar to |29 | with K = 600 and T = 5, we record the classification results in Table [T] 
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Baseline Error Rate (sq) vs. K 



Baseline Error Rate (exp) vs. K 
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Figure 2: Baseline classification performance on the USPS dataset 
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Figure 3: DDL classification performance on the USPS dataset 
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which lead to implications similar to those in our previous experiments. Interestingly, DDL achieves 
similar results to the robust sparse representation method of ll29l . which uses all training samples 
{K w 1200) as atoms in D. This shows that learning a discriminative D can reduce the dictionary 
size by as much as 50%, without significant loss in performance. 



Table 1: Baseline and DDL test eiTor on MNIST and E-YALE-B datasets 





MNIST (digit classification) 


E-YALE-B (face recognition) 


SQ EXP LOG HINGE 


SQ EXP LOG HINGE 


BASELINE 
DDL 


8.35% 6.91% 5.77% 4.92% 
1.41% 1.28% 1.01% 0.72% 


10.23% 9.65% 9.23% 9.17% 
8.89% 7.82% 7.57% 7.30% 



5 Conclusions 

This paper addresses the problem of discriminative dictionary learning by jointly learning a sparse 
linear representation model and a linear classification model in a MAP setting. We develop an 
optimization framework that is capable of incorporating a diverse family of popular classification 
cost functions and solvable by a sequence of update operations that build on well-known and well- 
studied methods in sparse representation and dictionary learning. Experiments on standard datasets 
show that this framework outperforms the basehne and achieves state-of-the-art performance. 
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